Introduction

The data science problem that we are trying to solve is predicting whether or not an employee will choose to leave a company based on their profile. This is classification of an employee’s attrition status based on demographic information, work history, and job details. This problem is important because high attrition rates mean companies are losing talented workers and must constantly expend resources to hire and train new employees. Using our model, a company could try to predict whether a given employee has a high chance of wanting to leave the company and incentivize the employee to stay. Additionally, noticing the trends in employee attrition could allow the company to make company-wide changes to decrease the overall attrition rate.

To solve this problem, we’re using a fictional dataset created by IBM data scientists that includes HR information related to an employee’s work life, history, marital status, education, and more. Additionally, each employee has a label corresponding to the employee’s attrition. There are 34 features in this dataset, which include information such as age, gender, marital status, monthly income, and the number of years at the company. Using this dataset, we would like to be able to predict whether an employee will leave the company based on their profile.

Ordinal Attributes

Education 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor'

EnvironmentSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'

JobInvolvement 1 'Low' 2 'Medium' 3 'High' 4 'Very High'

JobSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'

PerformanceRating 1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding'

RelationshipSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'

WorkLifeBalance 1 'Bad' 2 'Good' 3 'Better' 4 'Best'

A. Data prep

Checking for null values and duplicate records

We first wanted to see if there were any null values in our dataset, which we checked by finding the number of null values for each feature.

We saw that our dataset didn't contain any null values. We then used a function to drop duplicate rows from our dataframe, though we saw that there were no duplicate rows since the shape of the dataframe stayed the same.

Split features and labels

We then separated the column with the attrition status label from the features columns so that we could use this data with the scikit-learn models.

Remove irrelevant features

For the final piece of preparation before we explored the data, we decided to remove any columns with a single value. This is because a feature with only one possible value doesn't provide any differentiating information.

We saw that there were three features that only had one value: EmployeeCount, Over19, StandardHours. We then removed these columns from the dataframe.

'EmployeeNumber' also doesn't provide information for the employees because it's a random ID, so we removed this column.

B. Data Exploration

Understanding the numbers

To get a broad idea of what the dataset and values for each feature looked like, we took a look at the mean, median, and standard deviation for each feature. We also printed the numbers of records in each class for this dataset so we can check for class imbalance.

From this, we could see that 84% of the employees in this dataset stayed with the company, resulting in a class imbalance that we would need to address when creating our model.

Look at histograms of data

To get a better picture of the data, we generated histograms for each feature, showing the number of records with each possible feature value.

Look at Correlations

We decided to generate a correlation matrix for the features. We thought we'd be able to remove some of the features regarding each employee's rate (DailyRate, HourlyRate, MonthlyRate), since it seemed that these features would contain the same information at different time scales. We also thought there might be correlation between each employee's monthly income and monthly rate. To confirm these suspicions, we wanted to look at the correlations between the different features.

After checking the correlation heatmap, we saw that the different rate features were not correlated with each other, nor were they correlated with monthly income. Based on this information, we decided to keep the different rate features.

We also noticed that JobLevel and MonthlyIncome are highly correlated. However, because JobLevel differentiates tiers amongst employees with the same job roles (JobRole), we decided not to drop either feature.

Graphs for Attrition vs. Features

We then wanted to look at the relationship between different feature values and attrition rates. We divided the features up into five feature groups: satisfaction, job specifics, demographics, income, and work history with the current company.

Satisfaction vs. Attrition

For the satisfaction category, we looked at the following features: WorkLifeBalance, RelationshipSatisfaction, JobSatisfaction, EnvironmentSatisfaction, and OverTime. For each feature, we graphed the number of records with each value by attrition status.

We could see there is a much higher attrition rate in the population of employees who had to work overtime compared to the population of employees without overtime. As expected, we could also see that there are higher rates of attrition in the populations of employees with lower environment satisfaction, worse work-life balance, and lower job satisfaction.


Job Specifics vs. Attrition

The next category of features we looked at was related to details about each employee's job, which included the following features: JobLevel, JobInvolvement, BusinessTravel, StockOptionLevel, NumCompaniesWorked, Department, and JobRole. We again graphed the number of records with each feature value by attrition status. For NumCompaniesWorked, we generated a Kernel Density Estimate (KDE) plot, since there was a higher range of possible values.

We could see the highest attrition rates in the employee populations with the lowest job level, lowest job involvement, and frequent travel. We could also see high attrition rates in the sales representative role.


Demographics vs. Attrition

The next category of features that we looked at was demographics, which included the MaritalStatus, Gender, EducationField, Education, DistanceFromHome, and Age features. We generated KDE plots for DistanceFromHome and Age, and we plotted the other features by value and attrition status.

We could see that the population of single employees had a higher rate of attrition compared to married and divorced employees. There was also a higher proportion of employees who left the company in the population that had a farther distance from home compared to the population that was closer.


Work history with company

We then looked at the features related to each employee's work history with the company: TrainingTimesLastYear, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, and YearsWithCurrManager. We did this by creating paired box plots for each feature with TrainingTimesLastYear (since TrainingTimesLastYear is the only categorical feature in this group), separating the plots by attrition status.

From these plots, we can see that the population of employees leaving the company tended to have less years in their current role and with their current manager.


Income vs. Attrition

The last group of features we looked at was related to income and rates. This included PercentSalaryHike, MonthlyRate, HourlyRate, DailyRate, MonthlyIncome, and PerformanceRating. We created paired box plots for each feature with PerformanceRating (since PerformanceRating is the only categorical feature in this group), separating the plots by attrition status.

The boxplots show that the income features with respect to 'PerformanceRating' show no outliers with the exception of 'MonthlyIncome' where outliers are present due to income gaps between employees. We show the scatter plots to see if there are any outliers with the data in terms of clustering. From visual inspection, these scatter plots tell us that there are no outliers in the data for the corresponding features compared in the graphs.

C. Data Engineering

One-hot encoding

We first one-hot encoded the non-numeric features, which included the following features: BusinessTravel, Department, EducationField, JobRole, and MaritalStatus.

Feature creation

We decided to agglomerate the three satisfaction features (EnvironmentSatisfaction, JobSatisfaction, RelationshipSatisfaction) into one overall satisfaction index. We did this by replacing the three columns with a new column that contained the sum of those columns. We did this because it reduced the number of features in our dataset while still giving us a measure for each employee's overall satisfaction.

Binning

We binned certain features to reduce runtime for the models. 'Age' was binned because employees in certain age groups are likely to share similar traits. We binned employees by age into the following age groups: '18-24’, ‘25-34’, ‘35-45’, ‘45-55’, ‘56+’.

The rate and income features were binned since after analysis these features provided little information to justify a higher resolution.

D. Model Experimentation

We chose to test the following classification models:

During our data exploration, we noticed there was a class imbalance. While there were 1,233 records for employees that did not leave the company, only 237 employees left the company. To address this imbalance, we also used the Synthetic Minority Oversampling Technique (SMOTE) provided by the imblearn package, which creates pseudo employees records, with each model and evaluated the changes in accuracy, recall, and precision.

Naive Bayes

We tested with Naive Bayes since it doesn't suffer from the curse of high dimensionality.

Overall, Naive Bayes showed difficulties in the recall score for employees who did not leave as well as for the precision score for those who did leave IBM, leading to a low overall accuracy for the dataset.

K Nearest Neighbors

We tested the dataset with a K-Nearest-Neighbors (KNN) classifier here.

KNN performed well with the data with respect to the precision and recall for employees who did not leave IBM. For employees who did leave, the precision score was much less than the same score for those who did not leave, and the recall score for employees who left was also very low, making this classifier bad at determining those who won't actually leave the company (Type II error).

Multi-layer Perceptron

We test the dataset using a Multi-layer Perceptron (MLP) Classifier here. We decided to test this model since MLPs are capable of learning complex and diverse decision boundaries from the data.

As expected, the MLP classifier obtains a high score for the precision and recall of those employees who leave IBM. However, just like most of the classfiers we will look at, they suffer from a poor performance relative to the recall score for employees who do leave IBM.

Support Vector Machine

We tested the dataset using a Support Vector Machine (SVM) Classifier here. We expected it to perform well on our dataset compared to classfiers such as Naive Bayes and KNN due to SVM's effectiveness with high dimensional datasets.

The SVM classifier obtained a high precision and recall score for employees who did not leave IBM as well as a high precision for those who did leave. However, like many of the models tested, it did poorly in determining who will actually leave IBM and had a low recall score or high Type II error for those who will actually leave IBM.

Stochastic Gradient Descent Classifier

SGDClassifier is a generalized linear classifier that will use Stochastic Gradient Descent as a solver. With SGDClassifier you can use lots of different loss functions that allows you to tune your model and find the best model for your data. Since SVM performed well, we believed SGDClassifier would also perform similarly.

SGDClassifier's results place it as another top performing model based on it's precision and recall score for those employees who do not leave IBM, but still performs poorly in terms of identifying those employees who actually leave IBM based on those records' recall and precision score.

Decision Tree

We tested the data with a DecisionTree Classifier since it is one of the most interpretable models for classification.

To preface, saying no all the time for the dataset would obtain a score of 83.8%. With this knowledge in mind, the DecisionTree Classifier has poor performance as it doesn't compensate well for the 'Yes' class. This could be due to the data having few features with strong distinctions between those who leave IBM and those who don't, making it hard for the Decision Tree to determing good features to create splits on.

Visualizing Decision Tree

To understand why the Decision Tree Classifier performs poorly, we wrote the following code block to visualize the classifier.

The decision tree classifier relies on a metric of impurity called the gini, such that the goal of the tree is to split on the best feature that results in the largest information gain, or equivalently the largest difference between the gini before and after the split. In other words, we want purer subsets that ideally should be composed of either all attritions or all none attritions. Looking at the image, we see that most splits actually result in very small reductions in gini, or might even increase the impurity. This provides further evidence that our decision tree is struggling to find features that can distinguish employees who left the company from those who didn’t. In fact, examining the leaves (blue) that results in a prediction for ‘No’ attrition, we see that the gini for those leaves are very close to 0.5, which means that the tree is really just guessing with both ‘Yes’ and ‘No’ being equally likely despite prior splits.

Random Forest

Random Forest Classifiers are a collection of numerous Decision Trees combined with randomness towards how the splits per tree are determined. This classifier can account for more complex patterns in the data using a voting scheme as well as help reduce overfitting to the data compared to singular Decision Trees with its inclusion of a stochastic factor.

Random Forest performed better than Decision Tree, however, the tradeoff of interpretebility and duration for training the between the Random Forest and Decision Tree classifier for the small performance gain by the former is disappointing.

Visualizing feature importances

To understand each features' contributions towards the Random Forest Classifier's predictions, we wrote the following code block to graph their importance obtained from the classifier.

From the graph, we learn that the features 'OverTime', 'TotalWorkingYears', 'OverallSatisfaction', 'YearsAtCompany', and 'Age' are interpreted by the Random Forest Classifier as the ones with the highest importance. Also, the graph reveals how each job role varies in importance with the 'Sales Representative' role being many times more important than the 'Research Director'. This is consistent with our graphs earlier of the relationship between job role and attrition count.

AdaBoost

AdaBoost uses the boosting strategy to improve classifiers' classification score. We used AdaBoost on the top 3 classifiers tested prior that performed the best: MLP, SGD, and SVM.

As expected, AdaBoost did increase some of the metrics for scoring such as recall for 'No', increasing it all the way to 100%, however, this was at the cost of the recall for 'Yes,' giving a score of 8%, which is the worst out of all the classifiers. Overall, AdaBoost improves all other metrics at the cost of the recall for 'Yes'. Unfortunately, this is the metric that we care the most about since we want to predict which employees will leave and do not want to misclassify them.

E. Conclusion

Plotting ROC

We made three figures to summarize the results across all of our classifiers. The first graph graphs the ROC curves for each of our classifiers, SMOTE and Non-SMOTE, along with their AUC value.

AUC is an area under the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate against the false positive rate at various probability thresholds used to determine whether an employee has attributed. The closer the curve is to the upper-left corner, the better the model is performing to distinguish between attrited employees and not attrited employees, resulting in a higher AUC. Hence, we also plotted the ROC curves for each of our classifiers on the same plot so that we can easily spot the best classifier by this criteria.

From the graph, we see that MLP has the highest performance with the highest AUC value of .85 while DecisionTree with and without SMOTE had the lowest AUC values of .69 and .62, respectively.

The first is a grouped bar chart comparing various metrics across each model, namely their accuracies, precisions, recalls, f1 scores, and Area Under the Curve (AUC) scores. Precision, or the number of true positives over the sum of true positives and false positives, answers the question that given a positive prediction from the classifier, how likely is it to be correct. Recall, or the number of true positives over the sum of true positives and false negatives, answers the question that given a positive example, will our classifier detect it. But there’s a precision-recall tradeoff, so that we can use the f1-score, or a harmonic mean between precision and recall, to compare across classifiers instead. Finally, the AUC is a measure of the degree of separability, where the higher the AUC, the higher the true positive and true negative rate (and thus better performance). Therefore, we want to maximize all of these metrics, so that we can directly compare bar heights to find the best classifiers within each metric.

Slope-graph

To evaluate how SMOTE affected our results, we created a slopegraph to highlight the change across our evaluation metrics.

Selecting a classifier

Based on the slope-graph, we can identify Naive Bayes Classifier as the one with the highest recall score for employees who did leave IBM, however, it does perform poorly with respect to the other metrics compared to some of the other classifiers. Despite this fact, if we were choosing a classifier to provide us the best prediction of whether an employee would leave, Naive Bayes would be the one.

If we were using accuracy as the sole metric for selecting a classifier, MLP would be the best classifier according to the group bargraph. However, MLP suffers from a low recall for the 'Yes' class and has large variances in its precision, recall, and F1 scores, which lead us to pick SGD and SVM as the best overall classifier judging by the bargraph since they do not have the same variations in scores as MLP and still provide a high accuracy that is very close to MLP.

SMOTE vs Non-SMOTE

The comparison between classifiers with and without SMOTE might indicate that SMOTE cannot function well for high dimensional data. The recall for minor class increases a little bit at the cost of the precision for the minor class and recall for the major class drops drastically. SMOTE does not fully solve the class imbalance problem. Future work would include finding other methods to deal with the class imbalance problem for high dimensional data.